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Abstract 

Query tractability has been traditionally denned as a function of 
input database and query sizes, or of both input and output sizes, 
where the query result is represented as a bag of tuples. In this report, 
we introduce a framework that allows to investigate tractability beyond 
this setting. The key insight is that, although the cardinality of a query 
result can be exponential, its structure can be very regular and thus 
factorisable into a nested representation whose size is only polynomial 
in the size of both the input database and query. 

For a given query result, there may be several equivalent represen- 
tations, and we quantify the regularity of the result by its readability, 
which is the minimum over all its representations of the maximum 
number of occurrences of any tuple in that representation. We give a 
characterisation of select-project-join queries based on the bounds on 
readability of their results for any input database. We complement it 
with an algorithm that can find asymptotically optimal upper bounds 
and corresponding factorised representations. 

1 Introduction 

This paper studies properties related to the representation of results of 
select-project-join queries under bag semantics. In approaching this chal- 
lenge, we depart from the standard flat representation of query results as 
bags of tuples and consider nested representations of query results that can 
be exponentially more succinct than a mere enumeration of the result tuples. 
The relationship between a flat representation and a nested, or factorised, 
representation is on a par with the relationship between logic functions in 
disjunctive normal form and their equivalent nested forms obtained by alge- 
braic factorisation. When compared to fiat representations of query results, 
factorised representations are both succinct and informative. 



*A preliminary version has been submitted for publication on March 1, 2011. 



1 







name 


Ord 


ckey okey date 


Item 


okey disc 


Cust 


ckey 


Ol 


1 


1 


1995 


h 


1 0.1 


Cl 


1 


Joe 


2 


1 


2 


1996 




1 0.2 


C2 


2 


Dan 


03 


2 


3 


1994 




3 0.4 


C3 


3 


Li 


04 


2 


4 


1993 


i 4 


3 0.1 


C 4 


4 


Mo 


05 


3 


5 


1995 




4 0.4 








06 


3 


6 


1996 




5 0.1 



Figure 1: A TPC-H-like database. 



Example 1. Consider a simplified TPC-H scenario with customers, orders, 
and discounted line items, as depicted in Figure [TJ Each tuple is anno- 
tated with an identifier. The query Cust ^ c key Ord N ofcey Item reports 
all customers together with their orders and line items per order. A flat 
representation of the result is presented below: 
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For each result tuple, the identifiers of tuples that contributed to it are 
shown. For instance, the input tuples with identifiers c±, o\, and i\ con- 
tribute to the first result tuple. Our factorised representation is based on an 
algebraic factorisation of a polynomial that encodes the result. This encod- 
ing is constructed as follows. Each result tuple is annotated with a product 
of identifiers of tuples contributing to it. The whole result is then a sum of 
such products. For this example, the sum of products of identifiers is: 

ih = c i°ih + c\Oii 2 + c 2 o 3 i 3 + c 2 o 3 i A + c 2 o^ + c 3 o 5 i 6 . 

An equivalent nested expression would be: 

ip 2 = CiOi(ii + i 2 ) + c 2 (o 3 (i 3 + u) + o 4 i 5 ) + c 3 o 5 i 6 . 

A factorised representation of the result is an extension of this nested ex- 
pression with values from the result tuples: 

ci(l, Joe) 0l (l, 1995)(i!<0.1) + ?: 2 (0.2)) + 

c 2 (2,Dan)(o 3 {3, 1994)^(0.4) + z 4 (0.1)) + o 4 (4, 1993)i 5 (0.4)) + 
03(3,^)05(5, 1995)i 6 (0.1>. 

To correctly interpret this representation as a relation, we also need a 
mapping of identifiers to schemas. For instance, the identifiers c\ to C3 
are mapped to (ckey, name), which serves as schema for tuples (1, Joe), 
(2, Dan), and (3, Li). □ 
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We can easily recover the result tuples from the factorised representa- 
tion with polynomial delay, i.e., the delay between two successive tuples is 
polynomial in the size of the representation. For this, consider the parse 
tree of the representation. The inner nodes stand for product or sum, and 
the leaves for identifiers with tuples. A result tuple is a concatenation of the 
tuples at the leaves after choosing one child for each sum and all children 
for each product. We assume here that from a user perspective, iterating 
over the result with small delay is more important than presenting the whole 
result at once. 

Factorised representations can be more informative than flat representa- 
tions in that they better explain the result and spell out the extent to which 
certain input fields contribute to result tuples either individually or in groups 
with other fields. This enables a shift in the presentation of the result from 
a tuple-by-tuple view to a kernel view, in which commonalities across result 
tuples are made explicit by exploiting the factorised representation. We can 
depict it graphically as its parse tree or textually as a serialisation of this 
tree in tabular form. 

Example 2. The textual presentation of our factorised representation in 
Example [T] could be the left one below: 
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It is easy to see that two discounted line items (with discount 0.1 and 0.2) 
are for the same order 1 of customer Joe. 

Consider now the following factorised representation 

{ Sl (Joe} + s 2 (Dan) + s 3 (Li))( Pl (LCD) +p 2 (LED))+ 
s 4 (Mo)p 3 (BW) 

where s\ to S4 identify suppliers, and p\ to p% identify items. This represen- 
tation encodes that Joe, Dan, and Li supply both LCD and LED TV sets, 
and Mo supplies BW TV sets. A textual presentation of this result could be 
the right one above. The blocks between the horizontal lines encode tuples 
obtained by combining any of the names with any of the items. This rela- 
tional product is suggested by the x symbol between the blocks. (We skip 
the details on the mapping between the parse trees of factorised expressions 
and their tabular presentations.) □ 

In the factorised representation ip2 and in contrast to its equivalent flat 
representation ipi, each identifier only occurs once. We seek good factorised 
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representations of a query result in which each identifier occurs a small 
number of times. The maximum number of occurrences of any identifier 
in a representation, or in any of its equivalent representations, defines the 
readability of that representation. Readability implies bounds on the repre- 
sentation size. In our example, the size of the factorised representation is at 
most linear in the size of the input database, since its readability is one. 

Our study of readability is with respect to tuple identifiers and aligns well 
with query evaluation under bag semantics. This is different from readability 
with respect to values. For instance, V2 has readability one, yet a value 
may occur several times in the tuples of ip2, e -g-> the discount value of 0.1. 
Studying readability with respect to values is especially relevant to query 
evaluation under set semantics. 

2 Contributions 

The main contributions of this paper are as follows. 

• We introduce factorised representations, a succinct and complete rep- 
resentation system for (results of queries in) relational databases. In 
contrast to the standard tabular representation of a bag of tuples, 
factorised representations can be exponentially more succinct by fac- 
toring out commonalities across tuples. They also allow for an intuitive 
presentation, whereby commonalities across tuples are made explicit. 

• We give lower and upper bounds on the readability of basic queries 
with equality or inequality joins. 

The following holds for select-project-join queries with equality joins. 

• We introduce factorisation trees that define generic classes of factorised 
representations for query results. Such trees are statically inferred 
from the query and are independent of the database instance. A fac- 
torised representation $(T) modelled on T has the nesting structure 
of T for any input database. 

• We give a tight characterisation of queries based on their readability 
with respect to factorisation trees. For any query Q, we can find a 
rational number f(Q) such that the readability of Q(D) is at most 

I Q I • I H> I /(Q) for any database D, while for any factorisation tree T 
there exist databases for which the factorisation of Q(D) modelled on 
T has at least (|D|/|Q|)'«) 

occurrences of some identifier. 

• For any query Q, we present an algorithm that iterates over the fac- 
torisation trees of Q and finds an optimal one T. Given T, we present 
a second algorithm that computes in time 0(\Q\ ■ |D|-W) +1 ) for any 
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database D a factorised representation <&(T) of Q(D) with readability 
at most \Q\ ■ |D|-W) and at most |D|-W) +1 occurrences of identifiers. 

• Our characterisation captures as a special case the known class of hier- 
archical non-repeating queries [DS07a that have readability one [OH08J. 
We also show that non-hierarchical non-repeating queries have read- 
ability r2(-v/|I3|) for arbitrarily large databases D. 




Section [10] shows how to extend the above results to selections that 
contain equalities with constants. Proofs are deferred to the appendix. 



Our study has strong connections to work on readability of Boolean func- 
tions, provenance and probabilistic databases, streamed query evaluation, 
syntactic characterisations of queries with polynomial time combined com- 
plexity or polynomial output size, and selectivity estimation in relational 
engines. The present work is nevertheless unique in its use of succinct nested 
representations of query results. 

The notion of readability is borrowed from earlier work on Boolean func- 
tions, e.g., |GPB,061 IGMB081 IEMB09] . Like in our case, a formula <£> is 
read-m if each variable appears at most m times in and the readability 
of a formula or a function $ is the smallest number m such that there is 
a read-m formula equivalent to Checking whether a monotone function 
in disjunctive normal form has readability m = 1 can be done in time lin- 
ear in both the number of terms and number of variables |GMR08| . This 
problem is open for m = 2, and already hard for m > 2 or for m = 2 and 
monotone nested functions [EMR09J . This strand of work differs from ours 
in two key points. Firstly, we only consider algebraic, and not Boolean, 
equivalence; in particular, idempotence (x ■ x = x) is not considered since 
a reduction in the arity of any product in the representation would violate 
the mapping between tuple fields and schemas. Secondly, we only consider 
functions/formulas arising as results of queries, and classify queries based 
on worst-case analysis of the readability of their results. 

The hierarchical property [DS07aJ of queries plays a central role in stud- 
ies with seemingly disparate focus, including the present one, probabilistic 
databases, and streamed query evaluation. Our characterisation of query 
readability essentially revolves around how far the query is from its hierarchi- 
cal subqueries. We show that, within the class of queries without repeating 
relation symbols, the readability of any non-hierarchical query is dependent 
on the size of the input database, while for any hierarchical query, the read- 
ability is always one. This latter result draws on earlier work in the context 
of probabilistic databases |OH08[ IOHK091 IFOll] , where read-once polyno- 
mials over random variables are useful since their exact probability can be 




3 Related Work 
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computed in polynomial time. Read-m functions for m > 2 are of no use 
in probabilistic databases, since probability computation for such functions 
over random variables is #P-hard [VadOl]. In our case, however, readability 
polynomial in the sizes of the input database and query is acceptable, since 
it means that the size of the result representation is polynomial, too. 

Mirroring the dichotomies in the probabilistic and query readability con- 
texts, it has been recently shown that the hierarchical property divides 
queries that can be evaluated in one pass from those that cannot in the fi- 
nite cursor machine model of computation [GGL + 09] . In this model, queries 
are evaluated by first sorting each relation, followed by one pass over each 
relation. It would be interesting to investigate the relationship between the 
readability of a query Q and the number of passes necessary in this model 
to evaluate Q. 

Our study fits naturally in the context of provenance management [GKT07] . 
Indeed, the polynomials over tuple identifiers discussed in Example Q] are 
provenance polynomials and nested representations are algebraic factorisa- 
tions of such polynomials. In this sense, our work contributes a characteri- 
sation of queries by readability and size of their provenance polynomials. 

Earlier work in incomplete databases has introduced a representation 
system called world-set decompositions |OKA08| to represent succinctly sets 
of possible worlds. Such decompositions can be seen as factorised represen- 
tations whose structure is a product of sums of products. 

There exist characterisations of conjunctive queries with polynomial time 
combined complexity [AHV95J. The bulk of such characterisations is for 
various classes of Boolean queries under set semantics. In this context, even 
simple non-Boolean conjunctive queries such as a product of n relations 
would require evaluation time exponential in n. Our approach exposes the 
simplicity of this query, since its readability is one and the smallest factorised 
representation of its result has linear size only and can be computed in linear 
time. Factorised representations could thus lead to larger classes of tractable 
queries. 

Finally, there has been work on deriving bounds on the cardinality of 
query results in terms of structural properties of queries [GLS99, AGM08, 
IGLV09] . Our work uses the results in [AGM08J and quantifies how much 
they can be improved due to factorised representations. 

4 Preliminaries 

Databases. We consider relational databases as collections of annotated 
relation instances, as in Example [TJ Each relation instance R is a bag of 
tuples in which each tuple is annotated by an identifier. We denote by X(R) 
the set of identifiers in R, by «S(R) the schema of R, and call the pair 
(Z(R),<S(R)) its signature. 
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The size of a relation instance R is the number of tuples in R, denoted 
by |R[. The number of distinct tuples in R is denoted by ||R||. The size 
|D| of a database D is the total number of tuples in all relations of D. 

Remark 1. For the purpose of analysing the complexity of our algorithms, 
we assume that the tuples in the input database are of constant size. In 
many scenarios, this is however not realistic since even the encodings of the 
tuple identifiers must have size at least logarithmic in D. If the maximal 
size of a tuple in D is C(D), the time complexity increases by an additional 
factor C(D) or similar, depending on the exact computation model used.D 

Queries. We consider conjunctive or select-project-join queries written in 
relational algebra but with evaluation under bag semantics. Such queries 
have the form ■K^(a tp (Ri x . . . x R n )), where R\, . . . , R n are relations, ip is 
a conjunction of equalities of the form Ai = A2 with attributes A\ and A2, 
and A is a list of attributes of relations R± to R n . The size \Q\ of the query 
Q is the total number of relations and attributes in Q. 

Let Q = TT^(a lfi (Ri x • • - xR n )) be a query and D be a database containing 
a relation instance Rj of the correct schema for each relation Ri in Q. The 
result Q(D) of the query Q on the database D is a relation instance whose 
tuples are exactly those 71^ (ti x • • ■ x t n ) for which ij € Rj and t\ x • • • x t n \= 
ip. The tuple ir^{t\ x • ■ ■ x t n ) is annotated by id\id2 ■ ■ ■ id n , where idi is the 
identifier of ti in Rj. 

Every query can be brought into an equivalent form where all relations 
as well as all their attributes are distinct. To recover the original query Qq 
from the rewritten one Q, we keep a function fi that maps the relations in Q 
to relations in Qq, and the attributes of R in Q to those of n{R) in Qq. For 
technical reasons, we will only consider the rewritten queries in further text, 
the mapping /x will carry the information about different relation symbols 
representing the same relation. If a query Q has two relations with the same 
mapping fJ>(R), then Q is repeating; otherwise, Q is non-repeating. 

For any attribute A, let A* be its equivalence class, that is, the set of all 
attributes that are transitively equal to A in ip, and let r(A) be the set of 
relations that have attributes in A*. 

A query is hierarchical, if for any two attributes A and B, either r{A) C 
r(B), or r(A) D r(B), or r(A) C\r(B) = 0. 

Example 3. The query from Example [T] in the introduction is non-repeating 
and not hierarchical. 

Consider the relations R, S, and T over schemas {^4_r}, {As,B$}, and 
{Bt, U} respectively. The query n ^[aA R =A s ,B s =B T {R x S x T)] is not hier- 
archical (independently of the set A), since r(A$) % r(B$), r(As) r(Bs), 
but r(A s ) Pir(B s ) = {S}. The query ir A [aA R =As,Bs=B T ,A R =u(R X S x T)\, 

1 The original definition .DS07aj does not consider the output attributes A when check- 
ing the hierarchical property. 
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equivalent to R(A), S(A, B),T(B, A), is hierarchical, since t(Ar) = r(A$) = 
r(U) = {R, S, T} d r(B s ) = r{B T ) = {S, T}. □ 



5 Factorised Representations 

In this section we formalise the notion of factorised representations, their 
algebraic equivalence, and readability. We also give tight bounds on the 
readability of certain factorised representations that are used in the next 
sections to derive bounds on the readability of query results. 

Definition 1. A factorised representation, or f-representation for short, <I> 
over a set of signatures Sign is 

• $! + ••• + where <3?i to $ n are f-representations over Sign, or 

• $1 • • • <£ n , where $i to <3? n are f-representations over Sign 1 to Sign n , 
respectively, and these signatures form a disjoint cover of Sign, or 

• id{t), where id G IZi and t is a tuple over schema Si, and Sign = 

{(n u s t )}. 

The polynomial of $ is $ without tuples on identifiers. The size of (the 
polynomial of) $ is the total number of occurrences of identifiers in □ 

Two examples of f-representations are given in Section [TJ A relational 
database can have several algebraically equivalent f-representations, in the 
sense that these f-representations represent the same tuples and polynomials. 
Syntactically, we define equivalence of f-representations as follows. 

Definition 2. Two f-representations are equivalent if one can be obtained 
from the other using distributivity of product over sum and commutativity 
of product and sum. □ 

Each f-representation has an equivalent flat f-representation, which is a 
sum of products. A product i\(ti) ■ ■ ■ i n (tn) defines the tuple (ti o • • • o t n ) 
over schema Uj^j) which is a concatenation of tuples {t\) to (t n ), and is 
annotated by the product i\ . . . i n . 

Definition 3. The relation encoded by an f-representation $ consists of all 
tuples defined by the products in the flat f-representation equivalent to <&.□ 

Since flat f-representations are standard relational databases annotated 
with identifiers, it means that any relational database can be encoded as an 
f-representation. This property is called completeness. 

Proposition 1. Factorised representations form a complete representation 
system for relational data. 

In particular, this means that there are f-representations of the result of 
any query in a relational database. 
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Definition 4. Let Q = TT^{a v {R\ x • • • x R n )) be a query, and D be a 
database. An f-representation $ encodes the result 0(D) if its equivalent flat 
f-representation contains exactly those products id\ (vr^(ii)) • . . . • id n (^(in)) 
for which Tr^(ti x • • • x t n ) € 0(D), and idj is the identifier of ti for all i. 

The signature set of $ consists of the signatures (2j,<Sj) for each query 
relation such that Zj is the set of identifiers of the relation instance in 
D corresponding to Ri, and Si is the schema of Ri in Q restricted to the 
attributes in A. □ 

Flat f-representations can be exponentially less succinct than equivalent 
nested f-representations, where the exponent is the size of the schema. 

Proposition 2. Any flat representation equivalent to the f-representation 
(x±a + yiP) ■ . . . • (x n a + y n fi) over the signatures ({21, . . . ,x n },A) and 
({yi, . . . ,y n },B) has size 2™. 

In addition to completeness and succinctness, f-representations allow for 
efficient enumeration of their tuples. 

Proposition 3. The tuples of an f-representation $ can be enumerated with 
0(1$ | log |3>|) delay and space. 

Besides the size, a key measure of succinctness of f-representations is 
their readability. We extend this notion to query results for any input 
database in Section [71 

Definition 5. An f-representation $ is read-k if the maximum number of 
occurrences of any identifier in $ is k. The readability of $ is the smallest 
number k such that there is a read-fc f-representation equivalent to <3?. □ 

Since the readability of $ is the same as of its polynomial, we will use 
polynomials of f-representations when reasoning about their readability. 

Example 4. In Example [H the polynomial -01 is read-3 and the polynomial 
ip2 is read-1. They are equivalent and hence both have readability one. □ 

Given the readability p and the number n of distinct identifiers of a 
polynomial, we can immediately derive an upper bound np on its size. A 
better upper bound can be obtained by taking into account the (possibly 
different) number of occurrences of each identifier. However, for polynomials 
of query results, the bound np is often dominated by the readability p. 

In Section [JJ we define classes of queries that admit polynomials of low 
readability, such as constant readability. We next give examples of polyno- 
mials with readability depending polynomially on the number of identifiers. 

Lemma 1. The polynomial p^ = ^fj=i r iSijtj has readability -y + 0(1)- 

Lemma [T] can be generalised as follows. 
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Figure 2: F-trees for the query in Example [5j 



Theorem 1. The readability of the polynomial pn,m = Y2i=iY2j=i r i s ijtj 

If we drop the set of identifiers Sij, the readability becomes one. How- 
ever, if we restrict the relationship between the remaining identifiers, the 
readability increases again. 

Theorem 2. The readability of the polynomial = Y^j=i-i^j r itj ^ s 
^(t^n) andO(logN). 

The polynomials pn,m and qw are relevant here due to their connection 
to queries: pn,m is the polynomial of the query a^(R x S x T), where 
(f := (Ar = As A Bs = Bt) and the schemas of R, S, and T are {^4r}, 
{As,Bs}, and {-Br} respectively, on the database where R, S and T are 
full relations with |R| = n and |T| = m. Also, is the polynomial of the 
disequality query o a r ^b t (R x T). If i ^ j is replaced by i < j in qjy, the 
lower and upper bounds on readability on this new polynomial q' N still hold, 
and we obtain the result of an inequality query. 

A lower bound of ^/ i 1 giogi7 on ^ ne readability of q' N is already known 
even in the case when Boolean factorisation is allowed [GPR06J. 



6 Factorisation Trees 

We next introduce a generic class of factorised representations for query re- 
sults, constructed using so-called factorisation trees, whose nesting structure 
and readability properties can be described statically from the query only. 
We present an algorithm that, given a factorisation tree T of a query Q, 
and an input database D, computes a factorised representation of Q(D), 
whose nesting structure is that defined by T ■ Factorisation trees are used 
in Section [7] to obtain bounds on the readability of queries. 

Definition 6. A factorisation tree (f-tree) for a query Q is a rooted un- 
ordered forest T, where 
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Figure 3: The 7~-factorisation of a query result Q(D) is computed as 3>(T) = 
[71 (T), where T is the constant true (an empty conjunction). For a relation 
R in Q, R is the corresponding relation instance in the input database D. 

• there is a one-to-one mapping between inner nodes in T and equiva- 
lence classes of attributes of Q, 

• there is a one-to-one mapping between leaf nodes in T and relations 
in Q, and 

• the attributes of each relation only appear in the ancestors of its leaf.n 

Example 5. Consider the relations R, S, T, and U over schemas {Ar, Br, C}, 
{As, Bs, D}, {At, Et}, and {Eu, F} respectively, and the query Q = a !fi (Rx 
S x T xU) with (p = (A R = A S ,A R = A T ,B R = B S ,E T = E v ). Figure [2] 
depicts two f-trees for Q. 

Consider now the query Q' = a v (R x 5 x T) with (p = (Ar = As, Ar = 
At, Br = Bs)- Figure [7] on page [20l shows two f-trees for Q' as well as a 
partial tree that cannot be extended to an f-tree since the attributes As and 
D of S lie in different branches. □ 

Each f-tree for Q is a recipe for producing an f-representation of the 
result Q(D) for any database D. For a given query Q and database D, 
this f-representation is called the T- factorisation of Q(D) and is denoted 
by $(T). Figure [3] gives a recursive function [•] that computes the T- 
factorisation of Q(D). A more detailed implementation of this function, 
including an analysis of its time and space complexity, is given in Section [9l 

The function [•] recurses on the structure of T. The parameter 7 is 
a conjunction of equality conditions that are collected while traversing the 
f-tree top-down. Initially, 7 is an empty conjunction T. In case T is a 
forest {7i, . . . ,7n}j we return the f-representation defined by the product of 
f-representations of each tree in T. If T is single tree A* (JA) with root A* and 
children IA, we return the f-representation of a sum over all possible domain 
values a of the attributes in A* of the f-representations of the children U. To 
compute these, for each possible value a we simply recurse on U, appending 
to 7 the equality condition A* = a. Finally, in case T is a leaf R, we return 
a sum of f-representations for result tuples in R, that is, only those tuples 
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Figure 4: Database used in Example 



that satisfy 7. (When evaluating the selection with 7 on R, we only consider 
the equalities on attributes of R.) In the f-representation we only include 
attributes from Q's projection list, along with the tuple identifier. 

The symbolic products and sums in Figure [3] are of course expanded 
out to produce a valid f-representation. However, we will often keep the 
sums symbolic, abbreviate EoeDom^. ^° Ea* anc ^ wr ite R instead of 
Ei,- e<x-y (R) ^i^ 7r head(Q)(^')) ^ or expression generated by the leaves. The 
condition 7 can be inferred from the position in the expression, so we can still 
recover the original representation and write out the sums explicitly. Such 
an abbreviated form is independent of the database D and conveniently 
reveals the structure of any T-factorisation. 

Example 6. Consider the query Q from Example [5] and the f-trees from 
Figure [2 For any database, the left f-tree yields 

*(Ti) = Ea [ Ed ( Ec R Ed s) Ed (t E f u)], 
while the right f-tree yields 

<S>(T 2 ) = Ze(Za(Eb(EcREdS)T)(ZfU), 

both in abbreviated form. A procedure to produce the explicit form of $(71) 
is shown in Figure [5j 

For the particular database D given in Figure [H the f-representations 
<£(7i) and $(72) yield the polynomials 

-Pi =( r lll( s lll + S H2) + ri22«12l)*12(«21 + U22) + r212S21l(*21«ll + *22(«21 + u 22 
P2 =r212S21lhlUxl + ((nil (sill + SH2) + n22Sl2l)*12 + r 2 12S21lt22){u21 + U 2 2)- 

They are equivalent to each other and to the polynomial P of the flat f- 
representation of Q(D), 

P =^1115111*12^21 + nilSlll*12^22 + ^1115112*12^21 + 
nilSll2*12«22 + ^1225121*12^21 + n22Sl21*12«22+ 
^212S211*21«11 + ^"2125211*22^21 + ^2125211*22^22 • 

Whereas P is read-6, both Pi and P2 are read-2. □ 
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foreach value a € Dom^ do output sum of 
foreach value b e Domg do output sum of 

foreach value c e Dome do output sum of identifiers of i?-tuples (a, b, c) 

x 

foreach value d 6 Dom£> do output sum of identifiers of S'-tuples (a, b, d) 

x 

foreach value e € Doing do output sum of 
output sum of identifiers of T-tuples (a, e) 

x 

foreach value / G Donip do output sum of identifiers of [/-tuples (e, /) 



Figure 5: A procedure for producing T-factorisations in explicit form. The 
abbreviated form is J2a [J2b (J2c r Ed s )J2e { t J2f u )]- Ti is the left f- 
tree in Figure [21 

Remark 2. For any query Q, consider the f-tree T in which the nodes 
labelled by the attribute classes all lie on a single path, and the leaves 
labelled by the relations are all attached to the lowest node in that path. 
Such a tree T produces the T-factorisation in which we sum over all values 
of all attributes and for each combination of values we output the product 
over all relations of the sums of tuples which have the given values. If all 
the tuples in the input relations are distinct, the T-factorisation is just a 
sum of products, that is, the flat f-representation of the result. 

Thus, for a non-branching tree T we obtain a flat representation of Q(D). 
The more branching the tree T has, the more factorised the T-factorisation 
of Q(D) is. □ 

The correctness of our construction for a general query Q and database 
D is established by the following result. 

Proposition 4. For any f-tree T of a query Q and any database D, <£(T) 
is an f-representation ofQ(D). 

We next introduce definitions concerning f-trees for later use. Consider 
an f-tree T of a query Q. An inner node A* of T is relevant to a relation R if 
it contains an attribute of R. For a relation R, let Path(i?) be the set of inner 
nodes appearing on the path from the leaf R to its root in T, Relevant (R) C 
Path(i?) be the set of nodes relevant to R, and Non-relevant (R) = Path(i?) \ 
Relevant (R). For example, in the left f-tree of FigureEJ Non-relevant (R) = 
and Non-relevant (U) = {^4^}- In the right f-tree, Non-relevant (U) = 0, yet 
Non-relevant (R) = Non-relevant (S) = {E^}- In fact, there is no f-tree for 
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the query in Example [5] such that Non-relevant (R) = for each relation R. 
This is because the query is not hierarchical. 

Proposition 5. A query is hierarchical iff it has an f-tree T such that 
Non-relevant (R) = for each relation R. 

The left two trees shown in Figure [7] are f-trees of a hierarchical query. 
The first f-tree satisfies the condition in Proposition O whereas the second 
does not. 

7 Readability of Query Results 

The readability of a query Q on a database D is the readability of any 
f-representation of Q(D), that is, the minimal possible k such that there 
exists a read-fc representation of Q(D). 

In this section we give upper bounds on the readability of arbitrary select- 
project-join queries with equality joins in terms of the cardinality |D| of the 
database D. We then show that these bounds are asymptotically tight with 
respect to statically chosen f-trees. By this we mean that for any query Q, if 
we choose an f-tree T, there exist arbitrarily large database instances D for 
which the T-factorisation of Q(D) is read-A; with k asymptotically close to 
our upper bound. In the next section we give algorithms to compute these 
bounds. We conclude the section with a dichotomy: In the class of non- 
repeating queries, hierarchical queries are the only queries whose readability 
for any database is 1 and hence independent of the size of the database. 

A key result for all subsequent estimates of readability is the following 
lemma that states the exact number of occurrences of any identifier of a 
tuple (t) in the T-factorisation of Q(D) as a function of the f-tree T, the 
query Q = TT^(a v (Ri x • • • x R n )), and the database D. 

Let R = Ri be a relation of Q, denote by the condition S(R) = (t) the 
conjunction of equalities of the attributes of R to corresponding values in 
{t), and denote NR = Non-relevant (R). In the T-factorisation of Q(D), 
multiple occurrences of the same identifier from R arise from the summa- 
tions over the values of attributes from NR. Lemma [5] quantifies how many 
different choices of such values in the summations thus yield a given iden- 
tifier from R. Recall that the projection attributes A do not influence the 
cardinality of the query result and hence the number of occurrences of its 
identifiers, since we consider bag semantics. 

Lemma 2. The number of occurrences of the identifier r of a tuple {t) from 
R in the T-factorisation of Q(D) is 

Wi^NRioSi^^O-^Rl X ••• X i?„)))(D)|| . 
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For example, for the left f-tree in Figure [21 all identifiers in R, S, and T 
occur once, whereas any identifier of U may occur as many times as distinct 
A* values in R, S, and T. For the leftmost f-tree in Figure [7J all identifiers 
in all relations occur once, since no relation has non-relevant nodes. 

Lemma [2] represents an effective tool to further estimate the readabil- 
ity and size of T- factorisations. Our results build upon existing bounds 
for query result sizes and yield readability bounds which can be inferred 
statically from the query. Lemma [2] can be potentially also coupled with 
estimates on selectivities and various assumptions on attribute-value corre- 
lations [MD881 IPI971 IGTKOH IB.S10] to infer database-specific estimates on 
the readability. 

7.1 Upper Bounds 

Let D be a database, let Q = n^(a,p(Ri x • • • x R n )) be a query, let T be an 
f-tree of Q, and let R be a relation in Q. Denote NR = Non-relevant (R), 
by (fR the condition <p restricted to the attributes of NR, by Qr the query 
o~ Vr (ttnrRi x • • • x iTNRR n ), and by T)r the database obtained by projecting 
each relation in D onto the attributes of NR. 

Lemma 3. The number of occurrences of any identifier r from R in the 
T -factorisation of Q(D) is at most \\Qr(T)r)\\. 

Proof. By Lemma [21 the number of occurrences of r is equal to 

\\[nNR(o- S (R)=(t)0- v (Ri x ••• x J R n )))(D)|| , 

from which we obtain the desired bound by straightforward estimates: 

\\(^NR(o-S(R)={t)0-^(Rl X • • • X J R n )))(D)|| 

<\\{knr(o~<p(Ri x ••• x R n )))(D)\\ 
<\\{a VR (Tr NR (R l x---xR n )))(D)\\ 

=\\Qr(D r )\\. □ 

The number of distinct tuples in an equi-join query such as Qr can 
be estimated in terms of the database size using the results in [AGM08J. 
Intuitively, if we can cover all attributes of the query Qr by some k of 
its relations, then ||Qft(D#)|| is at most the product of the sizes of these 
relations, which is in turn at most |D| fe . This corresponds to an edge cover 
of size k in the hypergraph of Qr. The following result strenghtens this idea 
by lifting covers to a weighted version. 

Definition 7. For an equi-join query Q = a v {R\ x • • • x R n ), the fractional 
edge cover number p*(Q) is the cost of an optimal solution to the linear 



15 



program with variables {xi}f =1 , 
minimising ^ Xi 

subject to Y^i:Rier(A) x i — ^ f° r au attributes A, and 

> for all i. □ 

Lemma 4 ([AGM08 ). For any equi-join query Q and for any database D, 
we have ||Q(D)|| < |D|^*( Q ). 

Together with Lemma [3l this yields the following bound. 

Corollary 1. The number of occurrences of any identifier r from R in the 
T -factorisation of Q(D) is at most \T)\ P *(Q R \ 

Proof. By Lemma O the number of occurrences of r in the T-factorisation 
of Q(D) is bounded above by ||Qh(D^)||. By Lemma [H this is bounded 
above by \T>r\ p *( Qr \ which is equal to \D\p*^ r \ □ 

Corollary [I] gives an upper bound on the number of occurrences of iden- 
tifiers from each relation. Let M be the maximal number of relations which 
can contain the same identifier, that is, the maximal number of relations in 
Q mapping to the same relation name by p. Defining f{T) = m&XRp*(QR) 
to be the maximal possible p*(Qr) over all relations R from Q, we obtain 
an upper bound on the readability of the T-factorisation of Q(D). 

Corollary 2. The T-factorisation of Q(D) is at most read-(M ■ \D\f^). 

By considering the T-factorisation with lowest readability, we obtain an 
upper bound on the readability of Q(D). Let f(Q) = min-f f(T) be the 
minimal possible /(T) over all f-trees T for Q. 

Corollary 3. For any query Q and any database D, the readability ofQ(D) 
is at most M ■ \D\^ Q \ 

Since M < \Q\, the readability of Q(D) is at most \Q\ ■ |D|^ Q ). 

Example 7. For the query Q in Example [5] and the left f-tree in Figure [21 
the relation U is the only one with a non-empty query Qu = o~ L p u (ita r R X 
ita s S x tta t T), where the condition ipu is Ar = As = At- Since the 
other relations have empty covers (thus of cost zero), we conclude that their 
identifiers occur at most once in the query result. We can cover Qjj with any 
subset of R, S, and T. A minimal edge cover can be any of the relations, 
and the number of occurrences of any identifier of U is thus linear in the size 
of that relation. The fractional edge cover number is also 1 and we obtain 
the same bound. 

For the right f-tree in Figure [21 both R and S have non-empty queries Qr 
and Qs defining their non-relevant sub-query of Q: Qr = Qs = o- v (tte t T x 
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Ar,A s ,A t a r ,a s ,a t 

I I , 

Bs,Bt Cs,Cu 

I , I 

Cs,Cu D t ,Djj 

r/ \ X \ 

[Sj Dt,Djj Bs,Bt Er,Eu 

/ \ / \ / \ 

© ^ © © ® © 

Figure 6: F-trees 71 and 72 for the query in Example [71 

tte u U), where ip is Et = Ejj. The attributes £r and Ejj can be covered by 
U or by T. A minimal cover thus has size 1. The minimal fractional edge 
cover has also cost 1. 

Now consider a different query over the relations R(Ar, Er), S(Ag, B$, Cg), 
T{A T ,B T ,D T ) and U(Cu,Du,Eu), given by Q = ^(i? x 5 x T x U), with 
p = (A R = A s = A T , B s = B T , C s = C Ut D T = D v , E R = E v ). 

Consider the left f-tree 7i shown in Figure [6j For the relation R, we 
have Non-relevant (R) = {B S ,C S ,D T }, and hence the restricted query Qr 
will be Qr = (JB s =B T ,Cs=s u ,D T =D v {n Bs ,c s S x ttb t ,d t T x ttcu,d v U). We 
need at least two of the relations S, T, U to cover all attributes of Qr, the 
edge cover number is thus 2. However, in the fractional edge cover linear 
program, we can assign to each relation the value xs = xt = xjj = 1/2- 
The covering conditions at each attribute are satisfied, since each attribute 
belongs to two of the relations. The total cost of this solution is only 3/2. 
It is in fact the optimal solution, so p*(Qr) = 3/2. It is easily seen that 
P*(Qt) = P*(Qu) = 1 (since Qt can be covered either by S or U, and 
Qu can be covered by either S or T) and p*(Qs) = (since Qs has no 
attributes), so f(T\) = 3/2. We obtain the upper bound |D| 3 / 2 on the 
number of occurrences of identifiers from R, and hence on the readability of 
any 7I-factorisation. 

Note however that in the right f-tree T2 in Figure (H each of Qr, Qs, Qt 
and Qu is covered by only one of its relations, and hence /(7i) = 1. Any 
72-factorisation will therefore have readability at most linear in D. 

In fact, no f-tree T for Q has /(7~) < 1, so 7i is in this sense optimal 
and f(Q) = 1. □ 

7.2 Lower Bounds 

We also show that the obtained bounds on the numbers of occurrences of 
identifiers are essentially tight. For any query Q and any f-tree T, we con- 
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struct arbitrarily large databases for which the number of occurrences of 
some symbol is asymptotically as large as the upper bound. 

The expression for the number of occurrences of an identifier, given in 
Lemma El states the size of a specific query result. As a first attempt to 
construct a small database D with a large result for the query Qr, we pick 
k attribute classes of Qr and let each of them attain N different values. If 
each relation has attributes from at most one of these classes, each relation 
in D will have size at most N, while the result of Qr will have size N k . 

This corresponds to an independent set of k nodes in the hypergraph of 
Qr. We can again strenghten this result by lifting independent sets to a 
weighted version. Since the edge cover and the independent set problems 
are dual when written as linear programming problems, this lower bound 
meets the upper bound from the previous subsection. The following result, 
derived from results in [AGM08J, forms the basis of our argument. 

Lemma 5. For any equi-join query Q, there exist arbitrarily large databases 
D such that \\Q(D)\\ > (|D|/|Q|y*«). 

Now let Q = ir^(a lf (Ri x • • • x R n )) be a query, let T be an f-tree of Q 
and let R be a relation in Q. Define NR, (fR and Qr as before. We can 
apply Lemma [5] to the expression from Lemma [5] to infer lower bounds for 
numbers of occurrences of identifiers in the T- factorisation of Q(D). 

Lemma 6. There exist arbitrarily large databases D such that each identifier 
from R occurs in the T -factorisation ofQ(D) at least (|D|/|Q|)^( Q «) times. 

We now lift the result of Lemma [6] from the identifiers from R to all 
identifiers in the T-factorisation of Q(D). 

Corollary 4. There exist arbitrarily large databases D such that the T- 
factorisation of Q(D) is at least read-{\~D\/\Q\)^'\ 

Finally, by minimising over all f-trees T, we find a lower bound on read- 
ability with respect to statically chosen f-trees. 

Corollary 5. Let Q be a query. For any f-tree T of Q there exist arbitrarily 
large databases D for which the T-factorisation of Q(D) is at least read- 
(|D|/|Q|)/W). 

Example 8. Let us continue Example [7J For the left f-tree in Figure [21 
an independent set of attributes covering the relations R, S, and T of the 
query Qu is {^4Jj}- Since Qu only has one attribute, this is also the largest 
independent set, and the fractional relaxation of the maximum independent 
set problem has also optimal cost 1. 

For the right f-tree in Figure [5] the situation is similar. A maximum 
independent set of attributes covering the relations T and U of the queries 
Qr and Qs is {E^} and has size 1. 
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The situation is more interesting for the query Q. Recall that for the 
left f-tree 71 in FigureEJ Q R = o-b s =b t ,c s =s u ,d t =d u {^b s ,c s S x tt Bt ,d t T x 
^Cu,d v U), its attribute classes being NR = {B* s , C$, D^}. The maximum 
independent set for Qr has size 1, since any two of its attribute classes are 
relevant to a common relation. However, the fractional relaxation of the 
maximum independent set problem allows to increase the optimal cost to 
3/2. In this relaxation, we want to assign nonnegative rational values to 
the attribute classes, so that the sum of values in each relation is at most 
one. By assigning to each attribute class the value 1/2, the sum of values 
in each relation is equal to one, and the total cost of this solution is 3/2. 
This is used in the proof of Lemma [6] to construct databases D for which the 
identifiers from R appear at least (|D|/3) 3//2 times in the 7i-factorisation of 
Q(D), thus proving the upper bound from Example [7] asymptotically tight. 

Since all f-trees T for Q have f(T) > 1, the results in this subsection 
show that for any such f-tree T we can find databases D for which the 
readability of the T-factorisation of Q(D) is at least linear in |D|. □ 

7.3 Characterisation of Queries by Readability 

For a fixed query, the obtained upper and lower bounds meet asymptoti- 
cally. Thus our parameter f(Q) completely characterises queries by their 
readability with respect to statically chosen f-trees. 

Theorem 3. Fix a query Q. For any database D, the readability o/Q(D) is 
0(|D|/W)) 3 while for any f-tree T of Q, there exist arbitrarily large databases 
D for which the T-factorisation of Q(D) is read-Q{\D\^^). 

Theorem [3] subsumes the case of hierarchical queries. 

Corollary 6. Fix a query Q. If Q is hierarchical, the readability of Q(D) 
for any database D is bounded by a constant. If Q is non-hierarchical, for 
any f-tree T of Q there exist arbitrarily large databases D such that the 
T-factorisation of Q(D) is read-Q(\T>\). 

For non-repeating queries, the following result extends the above di- 
chotomy to the case of readability irrespective of f-trees. 

Theorem 4. Fix a non-repeating query Q. If Q is hierarchical, then the 
readability of Q(D) is 1 for any database D. If Q is non-hierarchical, then 
there exist arbitrarily large databases D such that the readability of Q{D) is 

n(v/jD[). 

8 Algorithms for Query Characterisation 

Given a query Q, we show how to compute the parameter f(Q) characteris- 
ing the upper bound on readability. We give an algorithm that iterates over 
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Figure 7: Left to right: Two f-trees and a tree which cannot be extended to 
an f-tree, used in Example 



all f-trees T of Q to find one with minimum /(T). We further prune the 
space of possible f-trees to avoid suboptimal choices. 

The following lemma facilitates the search for optimal f-trees. Intuitively, 
since the parameter /(T) depends on the costs of fractional covers of Qr for 
the relations R of Q, and since Qr is the restriction of Q to the attributs of 
Non-relevant (R) = Path(ii) \ Relevant(i?), by shrinking the sets Path(i?), 
the fractional cover number of Qr and hence the parameter f(T) can only 
decrease. 

Lemma 7. If T\ and T2 are f-trees for a query Q, and Path(R) in T\ is a 
subset of Path(R) in T% for any relation R of Q, then f(T\) < /(7i). 

In any f-tree T , each relation symbol R lies under its lowest relevant node 
A* . By moving R upwards directly under A*, Path(i?) can only shrink, and 
by Lemma[71 f(T) can only decrease. Thus, when iterating over all possible 
f-trees T to find one with lowest /(T), we can assume that the leaves are as 
close as possible to the root, and it is enough to iterate over all the possible 
subtrees formed by the inner nodes of f-trees. We next denote by reduced 
f-trees the f-trees where the leaves are removed. The only condition for a 
rooted tree over the set of nodes labelled by the attribute classes of Q to be 
a reduced f-tree, is that for each relation R, no two nodes relevant to R lie 
in sibling subtrees. Call this condition C. 

Example 9. Consider the relations R, S and T over schemas {Ar, Br, C}, 
{As, Bs, D} and {At, Et} respectively, and the query Q = a lfi (R x S x T) 
with tp = (Ar = As, Ar = At, Br = Bs)- Figure [7] depicts three trees. 
Without their leaves, the first two are reduced f-trees. The third tree is not 
a reduced f-tree as it violates condition C: the nodes A* s and D* lie in sibling 
subtrees, yet they are both relevant to S. We cannot place the leaf S under 
both of them. □ 

Any reduced f-tree is a rooted forest satisfying the condition C. Such a 
forest can either be a single rooted tree, or a collection of rooted trees. In 
the first case, the condition C on the whole tree rooted at A* is equivalent 



20 



Call a partition P\ , ■ ■ ■ ,P n good if for each relation R in Q, the nodes relevant 


to R lie in at most one Pi. 




iter(node set S) 




foreach A* G S do 


(1) 


foreach T e iter (S \ {A*}) do 




output tree formed by root A* and child T 




foreach good partition Pi, . . . ,P n of S do 


(2) 


foreach (71, ...,%)€ (iter(P 1 ), . . . , iter(P n )) do 




output 71 U • • • U Tn 





Figure 8: Iterating over all reduced f-trees. 



to C on the collection of subtrees of A*. In the second case, the condition 
C must hold in the individual subtrees, but in addition, for each relation 
R, the set of its relevant nodes Relevant (R) can only intersect one of the 
subtrees. This recursive characterisation of the condition C is used in the 
iter algorithm in Figure [8] to enumerate all reduced f-trees of a query with 
the set S of attribute classes. 

Example 10. Consider the query in Example [9l When algorithm iter 
chooses the root {Ar, As, At} in step 1, in the next recursive call it can 
split the remaining notes into Pi = {B* R ,C* ,D*} and P 2 = {E T }, since 
Relevant (R) and Relevant (S) only intersect Pi and Relevant (T) only inter- 
sects P%. The first tree in Figure [JJ is created like this. However, when we 
choose {Br,Bs} in step 1, in the next recursive call there are no possible 
partitions in step 2, since the node {Ar, As, At} lies in all of Relevant(P), 
Relevant (S), Relevant (T). The second tree in Figure [7] is created within this 
call, while the third tree in Figure which is not a valid reduced f-tree, is 
never produced. □ 

However, some choices of the root in line (1) and some choices of parti- 
tioning in line (2) of iter are suboptimal. Firstly we have 

Lemma 8. Let T he an f-tree. For two nodes A* and B* , if r(B) C r(A) 
and B* is an ancestor of A* , then by swapping them we do not violate the 
condition C and do not increase f(T). 

Thus, we do not need to consider trees with root B*. The second tree in 
Figure [7] is suboptimal, since B* is the root instead of A*. If r(B) = r(A), 
then A* and B* are interchangeable in any f-tree, and we need only consider 
one of them as the root. 

Secondly, in line (2) of iter, among all the good partitions, there always 
exists a finest one. That is, there always exists a finest partition P\, . . . ,P n 
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Define a partial order on the nodes of Q by A* > B* iff either r(A) D r(B), or 
r(A) — r(B) and A* is lexicographically larger than B* (to break ties arbitrarily 
among interchangeable nodes). Also, call a partition Pi, . . . , P n good if for each 
relation R in Q, the nodes relevant to R lie in at most one Pi. 

iter-pruned(node set S) 

let Pi, . . . ,P n be the finest good partition of S 
if n = 1 then 

foreach >-maximal A* 6 S do 

foreach T £ iter-pruned(5 \ {A*}) do 

output tree formed by root A* with T as its child 

else 

foreach (7i, . . . , 7^) £ (iter-pruned(Pi), . . . , iter-pruned(P„)) do 

output 71 U • • • U T n 



Figure 9: Pruned algorithm iter-pruned. 

of the attribute classes such that Relevant (R) only intersects one Pj for each 
relation R. We do not need to consider any coarser partitions in line (2): for 
any such coarser partition, we could split one of its trees into two, while not 
increasing Path (R) for any R and thus not increasing f(T) by Lemma [71 
Moreover, if n > 1, by a similar argument we do not need to execute line (1) 
at all, increasing the fanout of a node is always better. These observations 
lead to a pruned version of algorithm iter, given in Figure 

For the query in Example algorithm iter-pruned does not output 
the second tree in Figure [7J The node {-Br, B$} is not considered for the 
root since t(-Br) C t(Ar). In fact iter-pruned only produces the first tree 
from Figure [7J and exhibits such behaviour for all hierarchical queries: 

Proposition 6. For a hierarchical query Q, the algorithm iter-pruned has 

exactly one choice at each recursive call, and outputs a single reduced f-tree 
in polynomial time. 

Using lazy evaluation, at any moment there are at most linearly many 
calls of iter or iter-pruned on the stack. Between two consecutive output 
trees, there are at most linearly many recursive calls. The following theorem 
summarises our results so far. 

Theorem 5. Given a query Q, the algorithms iter and iter-pruned enu- 
merate reduced f-trees of Q with polynomial delay and polynomial space. 
Algorithm iter enumerates all reduced f-trees, while iter-pruned only a 
subset of these, which contains one with optimal f(T). 

Both algorithms can enumerate exponentially many reduced f-trees. 
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Figure 10: A reduced f-tree for Q12 with /(T) = 2, the lowest possible. 

For each constructed reduced f-tree, we can easily add the leaves with 
relations under the lowest node from Relevant (R). For each such f-tree T, 
we need to compute /(T), the maximum of p*(Qr) over all relations R in 
Q, which can be done in polynomial time, or using the simplex algorithm 
for linear programming. 

Example 11. Consider relations Ri over schemas {Ai,Bi} for 1 < i < n, 
let ip = f\™Z\{Bi = -^i+l) an d let Q n be the query \\ Ri. This query is a 
chain of n — 1 joins. 

A reduced f-tree T for Q12 is shown in Figure [TU1 In the corresponding 
f-tree, the leaves labelled by all relations apart from R\q hang from the 
leaves of the reduced f-tree. For all relations Ri, the query Qj^ has at most 
two attributes, so p*{Qr 1 ) < 2. However, for R\ (and most other Ri), each 
of the four relations of the query Qr x only has one of the two attributes, 
so the fractional edge cover number of is 2. It follows that /(T) = 2. 
In fact, this is the lowest possible value, and /(Q12) = 2. For arbitrary n, 



This example shows that branching in f-trees is key to low readability. An 
alternative yet naive approach is to choose a minimal set of attributes such 
that when these attributes are set to values from their domain, Q n becomes 
hierarchical. We can then sum over all possible domain values for each such 
attribute, and for each combination we create a read-once f-representation. 
In case of Q n , a minimal set of such attributes has cardinality ( |_t_|)> which is 
linear in the size of Q n . The corresponding f-tree would have f{T) = Q(n), 
which is exponentially worse than the optimal value. □ 

9 Algorithms for Computing T-factorisations of 
Query Results 

Figure gives a high-level recipe for producing the T-factorisation of Q(D), 
given an f-tree T of a query Q, and a database D. We present here a more 
detailed implementation of this algorithm and analyse its performance. 

A naive implementation of the factorisation algorithm, exactly mimick- 
ing the definition from Figure [3l is given in Figure [TT1 However, it contains 
two obvious inefficiencies. In line (1), it is inefficient to explicitly iterate 



f(Q n ) = Llog 2 nJ -1. 
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Let T be an f-tree for a query Q and let D be a database. The T-factorisation 
Q(D) is obtained by running gen(T, T), where 

gen(tree 7", conjunctive condition 7) 

if T is a tree with root A* and children U then 
create f- representation S :— an empty sum 

foreach value a of any attribute from A* in the database D do (1) 

append gen(W, 7 A (A* = a)) to S 
return S 

else if T is a collection of trees 71, ■ • ■ , T n then 

return gen(7I , 7) • . . . • gen(7^ , 7) 
else if T is a leaf i? then 

create f- representation S := an empty sum 

foreach tuple (tj) of R = i?(D) satisfying 7 do (2) 

append ^(^^(q)^) to S 
return S 



Figure 11: Naive implementation of the factorisation algorithm. 

over all values a of the attributes from A*, which appear in the database 
D, because for some of them, gen(U,j A (^4* = a)) necessarily produces an 
empty f-representation. Also, in line (2), it is inefficient to search every time 
through the entire relation R for tuples satisfying 7. 

We eliminate these inefficiencies in the implementation gen2 given in 
Figure [T21 by passing the ranges of tuples satisfying 7 for each relation of 
D, instead of the condition 7. At any call of gen, the set of attributes 
constrained by 7 forms a contiguous path ending at a root of the original 
tree T. Therefore, if we sort the tuples of each relation first by the highest- 
appearing attribute, then by the next-highest-one, and so on, then in each 
call of gen, the set of tuples satisfied by 7 will form a contiguous range 
in each relation. Thus in gen2, we only need to pass the pointers to the 
beginning and end of each such range. 

Moreover, if T in gen2(7", ranges) is a tree with root A*, for each relation 
R with an attribute A £ i*, the tuples of R in the corresponding range will 
be sorted by the attribute A. When iterating over values a of A* in line (1), 
using a mergesort-like strategy we can find those values a which appear at 
least once in the relevant range of each relation in r{A). For each such a 
we also find the corresponding range of tuples in each relation and recur se. 
For other a, i.e. those for which at least one relation with an attribute in 
A* has no tuples in the current range with value a in that attribute, the 
f-representation generated at A* 7 s children would be empty, and we do not 
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Let T be an f-tree for a query Q and let D be a database. Order the attributes 
of each relation by their appearance in T from higest to lowest, and then sort 
the tuples of each relation, by higher attributes first. The T-factorisation of 
Q(D) is obtained by running gen2(T, {(1, |Ri|)}" =1 ), where 

gen2(trcc T, ranges (starts, end^) for i = 1, . . . , n) 
if T is a tree with root A* and children U then 
create f-representation S := an empty sum 
repeat 

find the next ranges (start'^end^) C (starti,endi) 

of tuples sharing the same value a on A*, (1) 

append gen2(W, {(start-, end-)}™ =1 ) to S 

until no more such ranges exist 

return S 

else if T is a collection of trees 71, ... ,7^ then 

return Y[]=i gen2(7J, {(start;, endi)™ =1 }) 
else if T is a leaf Ri then 

return E^ttart, r j( 7r hcad(Q)^) (2) 



Figure 12: Improved implementation of the factorisation algorithm, 
need to recurse. 

Finally, if T is just a leaf R, the iteration in line (2) becomes trivial in 
gen2, we simply iterate over the corresponding range in R. 

Example 12. Consider the left f-tree T of Figure [2] and the database D 
used in Example [6j also shown in Figure [T3l Let us examine the execution 
of the call gen2(7", 7Z), where 7Z represents the full range in each relation 
of D. The root of T is the node {Ar, As, At}, relevant to the relations 
R, S and T. The first execution of line (1) finds the ranges given in red in 
Figure [T3l with the common value of Ar = As = At = 1. Notice that U 
does not have an attribute in the root node, so its range remains unchanged. 

After these ranges are found in line (1), they are passed to a next call 
of gen2 on the subtree formed by the children of {An, As, At}. When this 
call returns, the next execution of line (1) finds the ranges in R, S and T 
with Ar = As = At = 2, the range in U being again unchanged. 

For further illustration, we list all recursively invoked calls gen2(<S, ranges), 
for which S is a subtree of T rooted at an internal node. We use indenta- 
tion to express the recursion of the calls. For brevity, we only give the root 
of S instead of S, and we specify the ranges by giving the characterising 
condition 7. 
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Figure 13: An f-tree T and a database D during the execution of gen2. 



gen2(A*, T) = (r m (siu + s U2 ) + ri 2 2Sm)ti2(u2i + "22)+ 
+ r2i2S2ii(*2iUn + £22(^21 + U22)) 
gen2(B*,A* = 1) = (r m (s m + s m ) + ri 22 si2i 

gen2(C, A* = 1 A B* = 1) = r m 

gen2(£>, A* = 1 A B* = 1) = (s m + s m ) 

gen2(C, r = lA5* = 2)= n 22 

gen2(D,A* = 1 A = 2) = s m 
gen2(£*,A* = l)=ti 2 («2i+«22) 

gen2(F, A* = 1 A E* = 2) = (u 21 + u 22 ) 
E en2(B*,A* = 2) = r 212 s 2n 

gen2(C, A* = 2, S* = 1) = r 2 i 2 

gen2(£>, A* = 2,B* = l)=s 2U 
gen2(E*,A* = 2) = t 2 i«n + h 2 {u2i + 1*22) 

gen2(F, A* = 2 A = 1) = i 2 iMn 

gen2(F,A* = 2 A E* = 2) = t 22 (u 2 i + u 22 ). 

□ 

We next investigate the time complexity of gen2 as well as the size of 
the produced f-representation. 

The first observation is that all lines apart from line (1) take time linear 
in the output size. Consider now any particular call of gen2 on a subtree 
rooted at node A* and denote by P the path from the root of the f-tree 
to the node A*. During the execution of the loop containing line (1), for 
each relation Ri G r(A), the tuples in the range ( start j, endj) are sorted by 
their attributes in P in the order they occur in P. Therefore the iteration 
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over all the maximal subranges (start^, endj sharing the same value of A* 
can be done in a mergesort-like manner with a single simultaneous pass of 
the pointers start- and end- through the corresponding ranges (start j, end,). 
Since we assume that the tuples are of constant size, the time taken by line 
(1) is linear in the number of tuples in these ranges (start j, endj), for those 
i such that Ri G r(A). (For other R{ we keep (start-, end-) = (starts, endj).) 

Lemma 9. The time taken by line (1) of gen2 when computing the T- 
factorisation of Q(T>) is 0{\Q\ • |D|^ T ) +1 ). 

The time taken by the remaining lines is linear in the output size. 

Lemma 10. For any f-tree T of a query Q and any database D, the T- 
factorisation of Q(D) has size at most |D|^ r ) +1 . 

Proof. By Corollary [IJ for any relation R, each identifier r of a tuple from 
R occurs at most |D|^( Q fl) < |D|^ r ) times in the T- factorisation of Q(D). 
There are at most |D| different identifiers in the T-factorisation, so the total 
number of (occurrences of) identifiers is at most |D|^^^ +1 . □ 

Additionally, we need to sort the relations of D in the correct order before 
executing gen2, which takes time 0(|D| log |D|). Putting this together with 
Lemma [9] and Lemma \W\ we obtain a bound on the total running time of 
our factorisation algorithm gen2. 

Theorem 6. For any f-tree T of a query Q and any database D ; the algo- 
rithm gen2 computes the T-factorisation ofQ(D) in time 0(|Q|-|D| log |D|) 
for hierarchical queries and f(T) = 0, and 0(\Q\ ■ |D|-^^ +1 ) otherwise. 

There is a close parallel between our results and the results of [AGM08, 
GM06]. They show that for a fixed query Q, the flat representation of Q(D) 
has size OflD^W)) and can be computed in time 0(\D\ P *^ +1 ) for any 
database D, while in Lemma [10] and Theorem [6] we show that by allow- 
ing factorised representations, we can find one of size 0(|D|^^^) in time 
0(| D |/(Q)+i). 

The improvement in the exponent from p*(Q) to f(Q) is quantified by 
passing from a fractional edge cover of the whole query to the fractional edge 
covers of the individual non-relevant parts for each relation in an optimal 
f-tree of Q. If Q admits an f-tree with high degree of branching, this im- 
provement can be substantial. There are queries for which p*(Q) = 0(|Q|) 
while f(Q) = 0(1), the simplest example being a product query of n rela- 
tions. For such cases, the savings in both size of the representation and the 
time needed to compute it are exponential in \Q\. 
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10 Equalities with Constants 



We can extend our results to select-project-join queries whose selections 
contain equalities with constants. In the following, we call such queries 
simply queries with constants. 

Consider any query Q = n^(a v (Ri x • • • x Rn)) where p contains equal- 
ities with constants, and without loss of generality assume that ip is satisfi- 
able. Denote by C the set of all attributes of Q which are equated in <p to 
constants, either directly or transitively. Let tpc be the conjunction of equal- 
ities from tp which involve attributes from C and let p' be the conjunction of 
equalities from ip which do not involve attributes from C. Then <p = ip' A p>c , 
and hence Q = ir^(a^ia v , c (Ri x • • • x R n )). 

Define the query Q' = n ^{a^i{Ri x • • • x R n )). Then for any database 
D, we have Q(D) = Q'{a ipc (D)). Since Q' is now a select-project-join query 
without constants, this enables us to describe the factorisation properties of 
Q(D) using our existing results. 

Let us first extend our main definitions to queries with constants. 

Definition 8. For any query Q with constants, T is called an f-tree for Q 
if it is an f-tree for Q' . □ 

Definition 9. The T-factorisation of Q(D) is defined to be the T-factorisa- 



It follows immediately from Q'(cr vc (D)) = Q(D) that this definition is 
sound, i.e. that the T-factorisation of Q(D) is indeed an f-representation of 
Q(D). Just as for queries without constants, we can now define f(Q) to be 
the minimum /(T) over all f-trees T for Q. Equivalently, f(Q) = f{Q')- 

Corollary 7 (Extends Corollary [3]) . For any query Q with constants and 
any database D, the readability of Q(D) is at most M ■ |D|-^). 

Corollary 8 (Extends Corollary [5]) . Let Q be a query with constants. For 
any f-tree T of Q there exist arbitrarily large databases D for which the 
T-factorisation Q(D) is at least read-{\D\/\Q\)f^ . 

The dichotomy between non-repeating queries of bounded and unbounded 
readability extends to queries with constants with only a slight change. 

Corollary 9 (Extends Theorem [3]). Let Q be a non-repeating query with 
constants. If Q' is hierarchical, then the readability of Q(D) is 1 for any 
database D. If Q' is non-hierarchical, then there exist arbitrarily large 
databases D such that the readability of Q(D) is Sl(-\/|H)|) . 




Since the f-trees for Q are the same as for Q', to enumerate the f-trees 
for Q and to find an optimal one and hence f(Q), it suffices to compute Q' 
from Q and use the existing algorithms from Section [HJ 



tion of Q / (<t w (D)). 



□ 
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Finally, to compute the T-factorisation of Q(D), it is sufficient to com- 
pute Q' from Q and <r v , c (D) from D, and then to use existing algorithms 
from Section [9] to compute the T-factorisation of Q' (a vc CD)). Computing 
Q' takes time 0(|<5| 2 ) and computing o"^, c (D) takes time 0(|Q| • \D\). 

Corollary 10 (Extends Theorem [6]). For any f-tree T of a query Q with 
constants and any database D, we can compute the T-factorisation ofQ(D) 
in time 0(\Q\ ■ |D| log |D| + \Q\ 2 ) for hierarchical queries and /(T) = 0, and 
0(\Q\ • |D|^ r ) +1 + \Q\ 2 ) otherwise. 

11 Conclusion 

This work is the start of a research agenda on a new kind of representation 
systems and query evaluation techniques, where the logical model is that of 
relational databases yet the actual physical model is that of factorised repre- 
sentations. As a necessary first step, this paper classifies select-project-join 
queries based on their worst-case result size as factorised representations. 
We consider bag semantics for query evaluation here. We plan to further 
study the problems of query evaluation on factorised representations, de- 
signing a factorisation-aware storage manager, as well as approximations of 
queries with non-polynomial readability by lower and upper bound queries 
with polynomial readability. We also plan to develop a visualisation ap- 
proach of query results based on factorised representations. 
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A Deferred Proofs 



Proofs from Section [5] 
Proof of Proposition [3] 

We show that the tuples of an f-representation <I> can be enumerated with 
0(1$ | log |$|) delay and space. 

Each tuple represented by the f-representation <3? corresponds to a mono- 
mial of the polynomial of <£, and each such monomial consists of the iden- 
tifiers reached by recursively choosing one summand at each sum and all 
factors at each product. 

We can use pointers to keep track of the choice of summand at each 
sum. In general, we may have 0(|<3?|) sums, and need 0(log |<1?|) space per 
pointer. Any choice of pointers corresponds to a monomial of $ obtained 
by recursively exploring <I>, following the chosen summands and multiplying 
together all the reached identifiers. This can be done in time 0(1^1 log |<I>|) 
by a simple depth-first search. Not all sums are reached by this process, 
since some of them lie inside other summands which were not chosen. Call 
such sums disabled, and call the reachable sums enabled. 

Initially, the pointer at each sum is set to the first summand of the sum. 
This choice of pointers defines the first monomial. Consider any order tt of 
the sums that is consistent with their nesting in <I>, i.e. such that outer sums 
appear earlier in tt than inner sums. To advance to the next monomial, we 
advance the pointer of the last enabled sum in tt. Advancing the pointer 
of a sum S consists of updating it to point to the next summand of S. In 
case it already points to the last summand, we update it back to the first 
summand, and recursively advance the last enabled sum preceding S in it. 
If S is already the first enabled sum, we terminate. 

Updating a pointer of a sum potentially disables and enables other sums, 
but only sums appearing later in n. The above process proceeds backwards 
in 7r, traversing the enabled sums until finding one which is not pointing at 
its last summand. Therefore, when advancing to the next monomial, we can 
first find the enabled sums using a depth- first search in time 0( | <!> | log \ &\), 
and then use this information when advancing the pointers in time 0(|$|). 
Finally, finding the next monomial using the updated pointers takes time 
0(|$| log |$|). The total delay between outputting two monomials is thus 
0(|*|log|*|). 

Proof of Lemma [T] 

We show that the polynomial pn = Yli,j=i r i s ijtj nas readability y + 0(1). 

We first show that pn has readability at least ^. Let ip be any poly- 
nomial equivalent to pn and consider its parse tree, where adjacent sum 
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nodes are aggregated into a single node. If we expand ip by distributivity 
of product over sum, we must obtain the expression p^ = Yl!ij=i r i s ij^j- 
Therefore, there must be exactly one occurrence of in the parse tree, and 
it can have at most two multiplications on its path to the root. If there are 
two multiplications, tp is of the form 

((sij + ipi)(ri + ip 2 ) + i>z)(tj + tpi) + V>5 or 
((sij + ipi)(tj + tp 2 ) + 4>3)(ri + ip 4 ) + fa, 

but then necessarily, ipi, ip 2 and ip4 are empty (because if any two of rj, s^, 
or tj appear in a monomial in the result, the monomial must be riSijtj). 
Similarly, if there is one multiplication, tp is of the form 

(sij + ipi)((ri + 4>2)(tj + 1P3) + ipi) + "05, 

but then necessarily all of ipi, i/j 2 , 1P3 and ^4 are empty. In any case, Sjj 
appears in one of the forms 

(sijn + . . . )tj + . . . or (sijtj + . . . )n + 

Therefore, each appears directly in a binary product with a r- or t- 
identifier. Since there are N 2 of the s-identifiers, and 2N different r- and 
t- identifiers, at least one of the latter occurs at least y times in the expres- 
sion tp. 

To complete the proof, it is enough to exhibit a read-( y + 0(1)) factori- 
sation of pn- Defining ajy and bjy as 

_ v^TV ^|A72J-1 . 
a N - Lji = \ 2^j=0 r i s i(i+j)ti+j 

= Eili n (Ej=o 2J_1 Hi+tfi+l) . ( A ) 

°N — Z^i=l 2^j = lN/2\ r i s i(i+j) l i+j 
— 2^i=l 2-,j=[N/2\ r i-j s (i-j)i l i 

= ^2iLl (l>2f=[N/2i r i~j S (i-j)i) til ( B ) 

where all indices are considered modulo N, we get p^ = cin + fr/v- Each 
occurs once either in expression A or expression B. Each occurs once in 
A and [y] times in B, and each tj occurs [yj times in A and once in B. 
Thus, writing pjy as the sum of expressions A and B, we get a read-( |~^r] + 1) 
factorisation. 

Proof of Theorem [2] 

We show that the readability of the polynomial = Ylij=i-tyj a i^j * s 

n(i^r)buto(iogJV). 
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We first prove the lower bound. Any factorisation of the polynomial = 
X}fj=i-j^j a ibj is °f the form AiBi, where each Aj is a sum of a-variables 
and each Bi a sum of 6- variables. Represent each monomial aibj as an edge 
in the bipartite graph Kn,n and each sum of such monomials as a union 
of the corresponding edges. Each product AiBi then represents a complete 
bipartite subgraph (called a biclique) of i*0v,iVj an d the factorisation £^ Aj£>j 
represents an edge-disjoint union of such bicliques. 

If J2i AiBi = qiy, this union must be equal to the graph represented by 
oat, that is, the crown graph Gjy = {(<H,bj) \ i ^ j} C K^,n- The number 
of occurrences of a variable in the factorisation is the number of its bicliques 
containing the corresponding vertex of Kn,n- 

The readability of gjy, denote it pw, is then the smallest k for which Gn 
can be written as a union of bicliques in such a manner that each its vertex 
is included in at most k of these bicliques. 

Let Mfc be the largest N for which Gn can be written as a union of 
bicliques in such a manner that each its vertex is included in at most k of 
these bicliques. Then is the smallest k for which M k > N. 

Lemma 11. Mi = 2. 

Proof. On one hand, G2 = i^i,i + i^i,i can be written as a vertex-disjoint 
union of bicliques. On the other hand, G3 clearly cannot be written as a 
vertex-disjoint union of bicliques. □ 

Lemma 12. For k > 1, M k < k 2 M k _ x . 

Proof. First introduce some notation: if A = {a^ , ctj 2 , . . . , aj fe } is a set 
of vertices of Gn all coming from one partition, by A we will denote the 
opposite set {b^, bi 2 , . . . , fej fc }. Similarly we define B for a set B of vertices 
from the other partition. 

Let k > 1, N = k 2 M k _i, and suppose that Gn can be written as a union 
of bicliques such that each vertex is contained in at most k of them. 

Consider one such collection C of bicliques. The vertex a\ is contained 
in at most k bicliques {Aj x B{\i C C, and the vertex b\ is contained in at 
most k bicliques {A^ x B[}i C C. Since (JC = G n , we must have 

IJ-Bj = {6 2) ... ,6jv} and {jA' i = {a 2 ,...,a N }. 

i i 

Since \ \J i Bi\ = N — 1 = k 2 Mk-i — 1, and since k > 1, by the pigeon- 
hole principle there exists some -Bj such that > kM k -\. But IJj ^ = 
{a2,...,aAr} implies Uj^i = {^2, • • • 5 &iv} 5 -Bj, so there exists some A^ 
such that |A< n Bj\ > M k -i- Denote A = (A- n ^B~). This means that 
A C A[ and A C And |A| > M fc _i. 

Now consider the collection C restricted to A x A, i.e. 

V = C \ Ax -j= {{X x Y) n (A x A) I A x Y £ C}. 
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This is still a collection of bicliques, and it covers the graph induced by 
A x A, which is in fact isomorphic to G\m. Since \A\ > M^i, at least one 
vertex v of this subgraph is contained in at least k bicliques in T>. Since 
all bicliques in T> are restrictions of bicliques in C, v is also included in the 
corresponding bicliques from C. However, v is also included in the biclique 
Aj x Bj or A\ x B[ (depending on the partition it is in) . But since Aj n A = 
and A n B[ = 0, the restrictions of these two bicliques to A x A are empty, 
and thus neither of them is one of our original k bicliques containing v. 
Therefore, v is in fact included in at least k + 1 bicliques from C, which is a 
contradiction to our assumption. □ 

Corollary 11. For k>l, M k < 2(kl) 2 . 

Lemma 13. With k = ^°f Q g N , we have 2(kl) 2 < N for large enough N. 

Corollary 12. p N = 

For the upper bound on readability, we prove the following lemma. 
Lemma 14. For any N > 1, px < |0nv/2l+i- 
Proof. Write qjy as 

_ [iV/2] . v^AT , 

qN ~ L>i,j=lMj * 3 ^h3=\N/2]+l,iy£j r i l 3 

+ CiIfiV/2]+l *i)(Z}j=l ^ r i)- 

The first two sums are equivalent to q'r/v/a] an d 9| at/21 respectively, so they 
are both equivalent to at most read-p|-jv/2] expressions, but they contain dif- 
ferent variables. In the rest of the expression, each variable appears at most 
once. Therefore, the whole expression is equivalent to a read-(/9nv/2l + 1) 
expression. This completes the proof. □ 

Corollary 13. By induction, pn = O(logiV). 



Proofs from Section [6] 
Proof of Proposition 3] 

We show that for any f-tree T of a query Q, $(T) is an f-representation of 
Q(D) for any database D. 

To convert <J?(T) into a sum-of-products form, we repeatedly choose any 
sum ^2 A * appearing inside a product and distribute all the other factors to 
each of the summands. However, for each attribute class A*, all relations 
with any attribute from A* must appear as leaves of the subtree rooted at 
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A*, and hence all tuples from these relations must already appear inside 
the sum YIa*- Therefore, when moving factors into a sum Y1a*> we can 
also extend the conditions A* = a to these factors, as it will not affect the 
selections on relations contained in them. We can then move all the products 
downwards, obtaining an expression 

(R) (7Thead(Q) (tj)), (1) 

A* R 

where the sums are over all equivalence classes of attributes and the product 
over all relations of Q. This is equivalent to the sum-of-products represen- 
tation of Q(D). 

Proof of Proposition [5] 

We show that a query is hierarchical if and only if it has an f-tree T such 
that Non-relevant (R) = for each relation R. 

Let Q be a hierarchical query. By Proposition [61 when computing the 
f-tree T, the algorithm iter-pruned only has a single choice for the root of 
each subtree. This means that for each node A* in the tree, all its children 
B* satisfy r(A) 5 r(B). Therefore, the nodes relevant to each relation R 
not only lie on a path from the root of T, but form a contiguous path from 
the root of T ■ The leaf labelled by R is put directly under the lowest node 
of this path, and we get Non-relevant (R) = 0. 

Conversely, suppose that T is an f-tree for Q such that Non-relevant (R) = 
for each relation R. For any two attribute classes A* and B* of Q, either 
one is an ancestor of the other, or they appear in sibling subtrees. In the 
latter case, r{A) and r(B) are disjoint. In the former case, suppose wlog 
that A* is an ancestor of B*. Any relation R £ r(B) must appear in a leaf 
under the node B* . However, since Non-relevant (R) = 0, all nodes on the 
path from R to its root are relevant to R, and we must also have R £ r(A). 
This shows that r(B) C r(A) and completes the proof. 

Proofs from Section [7] 
Proof of Lemma [2] 

Let Q = ir^a v ,(Ri x • • • x R n ) be a query, T be an f-tree of Q, and 3>(T) 
be the T-factorisation of Q(D). Also let R = Ri be a relation of Q. By 
NR we denote the set Non-relevant (R) and by S(R) = (t) we denote the 
conjunction of equalities of all attributes of R to corresponding values in (t). 
Lemma [2] claims that for any database D, the number of occurrences of the 
identifier r of a tuple (t) from R in &(T) is equal to the number of distinct 
tuples in 

(KNR(<rs(R)={t)V<p(Rl x • • • x -R n ))) (D). 
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In 3>(7~), each time an expression [leaf -RJ(7) * s generated from the leaf 
R, it appears inside the summations YIa* over a ^ the v& l ues of attribute 
classes A* from Path(i?). Thus, each time an identifier r of a tuple (t) from 
R appears in 3>(T), it appears inside a [leaf -RJ(7) with a different condition 
7 on the attributes from Path(i?). 

However, not all 7 yield the identifier r in the expression [leaf i?J (7). 
Firstly, all the attributes in the nodes relevant to R must be assigned the 
corresponding value from (t) in the condition 7, otherwise the expression 
will not contain the identifier r. 

Secondly, even if the expression [leaf -RK7) contains r(t), it may happen 
that this expression is inside a product with an empty sum, and hence does 
not appear in the output 3>(T). In particular, r(t) from [leaf i£](7) appears 
in if and only if it appears in at least one monomial in the sum- 

of-products form of &(T). From the expanded form of $(T) given in the 
expression (1) in the proof of Proposition U we see that each such monomial 
corresponds to an extension 7' of the condition 7 to all attribute classes, for 
which all other relations also give a nonempty selection. 

Thus, each occurrence of r(t) in ^(T) corresponds to a condition 7 on the 
attributes from NR, such that (t) satisfies 7 and there exists an output tuple 
of Q(D) satisfying the condition 7. Each such condition 7 is determined by 
the choice of values of the attributes from NR, and each such choice of 
values corresponds to a tuple in 

{^NR((T S (R t ) = (t)^A R i X ' ' ' X #n)))( D )- 

Proof of Lemma [5] 

We show that for any equi-join query Q, there exist arbitrarily large databases 
D such that ||Q(D)|| > (|D|/|Q|)^( Q ). We essentially repeat the proof given 
in [AGM08] . fixing a minor omission and extending it to repeating queries. 

Suppose first that Q is non-repeating. Denote by a(R) the set of at- 
tribute classes containing attributes of a relation R. The linear program 
with variables da* labelled by the attribute classes of Q, 

maximising YIa* VA* 
subject to ^A*Ga(R) DA* — 1 f° r all relations R, and 
y A * > for all A*, 

is dual to the program given in Definition [71 

By this duality, any optimal solution {ua*} to this linear program has 
cost Xm* Va* = P*(Q)- We also know that there exists an optimal solution 
with rational values. Thus, there exist arbitrarily large iV such that N Va * 
is an integer for all A*. 

For any such N, we can construct a database D as follows. For each 
attribute class A, let Na = N VA * , and let [Na] = {!,..., Na} be the domain 
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for the attributes in A*. For each relation R of Q, let the relation instance 
R contain all tuples t for which t(A) e [Na] for all attributes A, but t{A) = 
t{B) for any attributes A and B equated in Q (i.e. such that A* = B*). For 
each attribute class A* in a(R) there are Na possible values of the attributes 
in A* , so the size of R will be 

1*1 = U A ^a(R) N A = TlA*ea(R) ^ = N^^Va* < jy. 

This implies that |D| < \Q\ ■ N. However, we have Y^A*ea(R) VA* = 1 for 
at least one relation R (otherwise we could increase any i/a* to produce a 
better solution to the linear program), so |D| > N. 

Any tuple t in the result Q(D) is given by its values t(A±) = • • • = 
t(Ak) £ [-Wai]} for each attribute class A\ = {A\,... ,Ak}, and any such 
combination of values gives a valid tuple in the output. The size of the 
output is thus 

IQ(D)| = Ua* n a = n^a* va* = N p*(q) > (\r>\/\Q\y*(Q). 

Since all tuples in each relation are distinct, all tuples in the output are 
distinct, and we also have ||Q(D)|| = |Q(D)| > (\D\/\Q\)p*W. The outer 
projection of Q does not reduce the cardinality of Q's result, since we con- 
sider bag semantics. 

Now suppose that Q is repeating, that is, contains multiple relations 
mapping to the same name. In that case, such relations require the same 
relation instance as their interpretation, while the database D constructed 
in the above proof may assign them different relation instances. However, 
consider the database D' constructed as follows. For any class {Ri, ■ ■ ■ , Rk} 
of relations mapping to the same name R, replace the relation instances 
Ri, . . . , Rfc in D by a single relation instance R = (J^ Rj in D'. 

Firstly, we have |D'| < |D|, since ||JjRj| < X^l-^il- Secondly, we still 
have |D'| > N, since the size of the largest relation in D' is at least the size 
of the largest relation in D. Finally, we have Q(D') 5 Q(D), because for 
any relation symbol Ri of Q, its interpretation R in D' is a superset of its 
interpretation Rj in D. Thus we get 

\\Q(W)\\ > \\Q(U)\\ > (|D|/|Q|y«> > (|D'|/|Qir* (Q) , 

which completes the proof. 

Proof of Lemma [6] 

Let Q = TTA(cr lfi (Ri x • • • x R n )) be a query, T be an f-tree of Q, and 3>(T) be 
the T-factorisation of Q(D). Also let R be a relation in Q. We show that 
there exist arbitrarily large databases D such that each identifier r from R 
occurs in $(T) at least (|D|/|Q|) P *( Q «) times. 
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Recall that the query Qr is obtained by restricting Q to the attributes 
of NR = Non-relevant (R), and omitting the projection tt^. 

Applying Lemma [5] to the query Qr, we obtain that there exist ar- 
bitrarily large databases D R such that ||Q r (D_r)|| > (\B r \/\Q r \)p^ Qr \ 
Construct the database D by extending T)r: for each new attribute A al- 
lowing a single value 1, and extending each tuple in each relation by this 
value in the new attributes. For relations appearing in Q but with no at- 
tributes in Qr, the relation instance in D will consist of a single tuple with 
value 1 in each attribute. Notice that \Qr\ < \Q\ and |D| = \Dr\, so that 

HQaCDjOH > (\T>\/\Q\y* iQR) - 

Finally, a tuple from T)r satisfies (fR if and only if the corresponding 

extended tuple satisfies ip, since the values in all attributes outside NR are 
equal. Moreover, since R has no attributes in NR, each identifier r from R 
corresponds to the tuple (t) = (1, . . . , 1), and each tuple from (R\ x • • • x R n ) 
satisfies <?s(R)=(t) ■ By Lemma [21 the number of occurrences of any r from 
R in the T-factorisation of Q(D) is 

\\KNR(<TS(R)={t)<T<p(Rl X • • • X Rn))]] 
= ||7TAT,rO v (-Ri X • • • X R n ))\\ 
= \Wip R (^NR(Rl X • • • X R n ))\\ 

=\\Qr(Pr)\\ 

>(|D|/|Q|)'*«*>. 
Proof of Corollary [6] 

We show that if Q is hierarchical, the readability of Q is bounded by a 
constant, while if Q is non-hierarchical, for any f-tree T of Q there exist 
databases D such that the T-factorisation of Q(D) is read-0(|D|). 

By Proposition if Q is hierarchical, there exists an f-tree T of Q such 
that Non-relevant (R) = for all relations R. For any such tree T we have 
/(T) = 0, hence f(Q) = 0, and by Theorem [3l the readability of Q(D) is 
O(l). 

If Q is non-hierarchical, for any f-tree T there is a relation R such that 
Non-relevant (i?) is nonempty. Then the query Qr contains at least one 
attribute, and hence p*(Qr) > 1. Therefore f(T) > 1 and also f(Q) = 1. 
The result then follows from Theorem 

Proof of Theorem 31 

We show that for a fixed non-repeating query Q, the following holds. If 
Q is hierarchical, the readability of Q(D) is 1 for any database D. If Q 
is non-hierarchical, there exist arbitrarily large databases D such that the 
readability of Q(D) is Q(^/\D\). 
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In case Q is hierarchical, then by Proposition [5J there exists an f-tree 
such that Non-relevant (R) = for any relation R of Q. By Lemma [2] it 
follows that hierarchical queries admit f-representations with readability 1. 

If Q is not hierarchical, there exist attribute classes A* and B* such that 
r(A) <2 r(B), r(B) <2 r(A) and r(A) n r(B) / 0. Thus there must exist 
a relation S with attributes from A* and B*, a relation R with attributes 
from A* but not B*, and a relation T with attributes from 5* but not A*. 

Fix any positive integer N. Consider a database instance D in which 
the domains of attributes in A* and B* are {1, . . . , N} and the domains of 
all other attributes are {1}. For each relation R, let its interpretation R 
be the set of all possible tuples with the above domains, which respect the 
equivalence classes of attributes. We annotate the tuple in R with j4*-value 
i by rj, tuple in T with £?*-value j by tj, and tuple in S with A*-value 
i and B* -value j by Sy. All relations contain N 2 , N, or only one tuple, 
depending on whether they contain attributes from A*, B*, both or none. 
Thus, |D| = Q{N 2 ). 

The polynomial of the flat f-representation of Q(D), restricted to the 
identifiers from R, S and T, is Ylij=i r i s ijtji which is exactly the polynomial 
Pn defined in Lemma[TJ By LemmaHJ this polynomial has readability f2(iV). 
Since any f-representation of Q(D) restricted to the identifiers of R, S and 
T is equivelant to pn, Q(D) also has readability fi(iV) = 0(y^jD]). 

Proofs from Section [8] 
Proof of Lemma [7] 

For any f-tree T and relation R labelling a leaf of T, denote by Path7-(i?) the 
set of ancestor nodes of R in T (thus emphasising the role of the tree T in our 
previous notation Path(i?)), and similarly for Non-relevant t(-^0- We show 
that for any two f-trees 71 and T2 for a query Q, if Path-^ (ii) C Path7^(i?) 
for any relation R of Q, then /(7i) < /(T^)- 

For any relation R of Q, if Path7^(i?) C Path-7^,(i?), then also 
Non-relevant 7^ (R) C Non-relevant^ (R). If we let Q^ 1 be the query induced 
by Non-relevant^ (R), and Q R 2 the query induced by Non-relevant^ (R), 

1~ 1~ T~ 

Qft is an induced subquery of , i.e. the hypergraph of is an induced 
subhypergraph of Q ^ 2 . 

If we denote by L\ the fractional-cover linear program for , as defined 
in Definition [71 and by L2 the fractional-cover linear program for Qr, then 
the variables of L\ are just a subset of variables of L2, and the linear con- 
ditions of L\ are respective restrictions of the conditions of L2. Thus, any 
optimal solution of L2 can be restricted to a feasible solution of L\. The cost 
of such a restricted solution in L\ is always at most the cost of the original 
solution in L2, which implies that p*{Q]^) < p*(Q^). By minimising over 
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R, we obtain /(7I) < /(75). 



Proof of Lemma [5] 

Let T be an f-tree. For two nodes A* and B* , we show that if r(B) C r(A) 
and S* is an ancestor of A* , then by swapping them we do not violate the 
condition C and do not increase f(T). 

For any relation R £ r(A), the positions of nodes from Relevant(ii) 
remain unchanged. For any relation R € r(A), the leaf labelled by R is 
under A* and hence by swapping A* and B*, all nodes relevant to R stay on 
the path from R to the root. Therefore, the condition C remains satisfied. 

It remains to prove that by this swap, the parameter f(T) does not 
increase. The only relations R for which the set Path(i?) changes (and thus 
P*{Qr) can change), are those lying in the subtree under B* but not in the 
subtree under A*. For such R, we replace the node B* in Path(i?) by the 
node A*. Consider the fractional-cover linear program for Qr, defined as 

minimise Yli x i 
subject to X^i-H,Gr(A) x, i — 1 f° r au attributes A, and 
xi > for all z. 

in Definition [71 By replacing B* with A*, the only change to this program 
is the strenghtening of the condition Y^i -R l&r (B) Xi>l to Yli-Ri&fA) x * — 1- 
Therefore, the cost p*(Qr) of the optimal solution can only decrease. By 
minimising over all relations R of Q, we conclude that f(T) can also only 
decrease. 

Proof of Proposition [6] 

We show that for a hierarchical query Q, the algorithm iter-pruned has 
exactly one choice at each recursive call, and outputs a single reduced f-tree 
in polynomial time. 

The standard algorithm for recognising hierarchical queries (described 
in [DS07b| . though in the language of conjunctive queries) is as follows. 

• Find the connected components of the query, in the sense that two 
relation symbols are connected if some of their attributes are equated 
by the query. 

• For each connected component, there must exist an attribute class 
with attributes in each relation in the component. If not, the query is 
not hierarchical. Create a node labelled by this attribute class, make 
it the root of an f-tree, and recurse on the rest of the component to 
produce its children subtrees. 

• Output the disjoint union of the trees produced for each component. 



41 



The connected components of the query correspond to the finest partition 
Pi , . . . , P n of the attribute classes such that each relation only has attributes 
from one Pj. If the considered query is hierarchical, for each such p there 
exists an attribute class with attributes in each relation of Pj. That is, there 
exists at least one A* € Pj such that for other classes B* £ Pj, r(A) D r{B). 
The lexicographically greatest such A* will be the maximum element in the 
>-order. The algorithm iter-pruned will therefore only consider this A* 
for the root of the subtree formed by Pj. 

We have thus shown that for hierarchical queries, iter-pruned essen- 
tially follows the recognising algorithm given above, never branching when 
picking the root node, and hence outputting a single reduced f-tree. This 
also means that there are at most linearly many recursive calls of iter- 
pruned. Since each call takes polynomial time, the total running time is 
also polynomial (in the size of the query). 

Proofs from Section [9] 
Proof of Lemma [9] 

We show that the total amount of time taken by line (1) of gen2 when 
computing the T-factorisation of Q(D) is 0(\Q\ ■ |D[-^ r ) +1 ). 

Let A* be any node in T, let U be the subtree of T rooted at A* and let 
Path(j4) be the set of ancestor nodes of A*. Consider any call gen2(ZY, TZ), 
where TZ is a collection of ranges of tuples in D. For each such call, the tuples 
in TZ agree on the values of attributes from Path(^4), moreover, the ranges 
TZ contain all tuples of D with these values. Denote by 7 the condition on 
the attributes from Path(A) with the values given by tuples in TZ. For each 
call gen2(U ,TZ), the ranges TZ are different and hence this condition 7 is 
different. Conversely, for any 7 such that the corresponding ranges in the 
relations of D are all nonempty, gen2 will be called with these ranges in the 
second parameter and U in the first parameter. 

We will now calculate the total amount of time taken by line (1) in all 
calls of gen2(W,7£) for a fixed IA, rooted at A*. We have argued before the 
statement of the Lemma that the amount of time taken by line (1) in any 
single call gen2(U,TZ) is linear in the number of tuples in the ranges TZ. 
Instead of summing the number of tuples in TZ for each such call, we will 
fix a tuple (t) and find the number of calls for which TZ contains this tuple. 
Equivalently, we will find the number of the corresponding conditions 7 
satisfied by {t). 

For a condition 7, the ranges TZ corresponding to 7 in D are nonempty iff 
(<7 7 (Pi x • • • x P n )) (D) is nonempty. Furthermore, the corresponding ranges 
are nonempty and 7 is satisfied by (t) iff (<Js(R)={t) (^(Pi x • • • x P ra )))(D) 
is nonempty. Equivalently, this is true iff (ftp a th(A){ a S(R)=(t) (<?7(Pi x • • • x 
P n ))))(D) is nonempty, but moreover, in such case that set contains pre- 
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cisely one element, which uniquely corresponds to the condition 7. There- 
fore, the total number of conditions 7 on the attributes of Path(A), for which 
the corresponding ranges are nonempty, and which are satisfied by (t) , is 

E 7 ll( 7r Path(A)(^5(fl)=(t)K(- R l X ••• X Rn))))(D)\\ 
= IIU 7 7r Path(A)(^5(i?)={i>K(- R l X ••• X Rn))))(D)\\ 
= \\{^Pa,th(A){^S(R)={t)(^a(Rl X ••• X R n )))) (D) 1 1 , 

where ^ and (J range over all possible conditions 7 assigning values from 
D to attribute classes of Path(A), and a expresses the equality of attributes 
in each attribute class of Path(A), without assigning them particular values. 
However, if we let NA = Path(^4) \ Relevant (R), we get 

\\{Kp a th(A)(vS(R)={t)(Va(Rl X ••• X i? n ))))(D)|| 
= \\[KNA(<rs(R)={t)(<ra(Ri X ••• X Rn)))) (D) J I 
< \\{itna(<t«(Ri x ••• x Rn)))(D)\\ 
< \\{a ipNA (7r NA (R l x ••• xi? n )))(D)|| 
= \\Qna(Pna)\\, 

where Qna and T>na are defined analogously to Qr and D^. By Lemma HI 
this number is at most \T)na\ p *^ na ^ = \~D\ P *(Q NA \ However, since Qna is 
an induced subquery of Qr, we have p*(Qna) < P*(Qr), which is in turn at 
most f(T). We can thus conclude that for a fixed tuple (t) from a relation 
R E r (A), the total number of conditions 7 on the attributes of Path(A), 
for which the corresponding ranges are nonempty, and which are satisfied 
by (t), is at most |D|^ r ). 

There are at most |D| tuples in the relations of r(A), so the total amount 
of time taken by line (1) in all calls of gen2(U, 1Z), for U rooted at a fixed 
node ^4*, is linear in |D|-^^ +1 . Since there are at most \Q\ different nodes 
A*, so the total time taken by line (1) is linear in \Q\ ■ |D|^^^ +1 . 



Proofs from Section 10 



Proof of Corollary [7] 

We show that for any query Q with constants and any database D, the 
readability of Q(D) is at most M ■ |D|^°). 

Recall that M is the maximal number of relations of Q mapping to the 
same name, and is the same for Q as for Q' . By Corollary [3j the readability 
of Q(D) = Q , (^ c (D))isatmostM-|^ c (D)|/( ( 3'). Since /(Q') = f(Q) >0 
and \a vc (D)| < |D|, this is at most M ■ \D\^ Q '\ 
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Proof of Corollary [8] 

We show that for any query Q with constants and any f-tree T of Q, there 
exist arbitrarily large databases D for which the T- factorisation Q(D) is at 
least read-(|D|/|Q|)-W). 

The attributes in C do not appear in any equalities in Q', so each at- 
tribute is only relevant to one relation. In any f-tree of Q' , we can move 
these attributes downwards towards their respective relations, thus only de- 
creasing the non-relevant sets for other relations, and hence not increasing 
/(T). It follows that there exists an f-tree T with /(T) = f(Q'), such that 
for any relation R, Non-relevant (R) does not contain any attributes from C. 

Now by Corollary [5l there exists arbitrarily large databases D for which 
the T-factorisation Q'(D) is at least read-(|D|/|Q'|)^^ ). Moreover, from 
the proof of Lemma [6] it follows that D can be constructed in such a way 
that for some relation R, all attributes not in Non-relevant (R) have domain 
of size one. In particular, all attributes from C have domain of size one. 
By renaming the values of these attributes to the respective constants from 
ipc we can arrange that a tfc (Y)) = D. Since \Q\ = \Q'\ and f(Q) = f(Q'), 
it follows that the T-factorisation Q(D) = Q'(&ip c (D)) is at least read- 
(|D|/|Q|)/»>. 

Proof of Corollary [9] 

Let Q be a non-repeating query with constants. We show that if Q' is 
hierarchical, the readability of Q(D) is 1 for any database D, and if Q' 
is non-hierarchical, there exist arbitrarily large databases D such that the 
readability of Q(D) is fl{^/\D\). 

For hierarchical queries we have f(Q) = and the result follows from 
Corollary [71 For non- hierarchical queries, by Theorem U] there exist arbitrar- 
ily large databases D such that the readability of Q'(D) is f2(-y/jrJ|). More- 
over, from the proof of Theorem [5] it follows that apart from two attribute 
classes A* and B* such that r(A) % r(B), r(B) % r(A) and r(A)C\r{B) ^ 0, 
we can arrange that all attributes have domains of size one. We cannot have 
A € C or B E C, since each attribute in C is only relevant to one relation, so 
we can in fact arrange that all attributes from C have domains of size one. 
Again by simple renaming of values, we obtain D = a Vc (D), and hence the 
readability of Q(D) = Q'{<J? C (D)) is Sl(y/\D\). 
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